Data reshaping of long and wide data

Introduction

Welcome to a tutorial on reshaping data in R using the dplyr and tidyr packages. In this tutorial, you’ll learn how to convert data between wide and long formats, understand when each format is useful, and see how to perform calculations more efficiently with long-format data. We will use examples to illustrate these concepts, focusing on economic data such as GDP components for multiple countries.

Prerequisites

Before diving into the analysis, let’s load the necessary R packages. These packages will help us manipulate data efficiently.

library(dplyr)    # For data manipulation

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)    # For reshaping data
library(here)     # For building file paths
here() starts at C:/IMF-R-Book

Understanding Wide and Long Formats

The same dataset can be written in two different formats: wide and long.

Wide Format

In a wide format, values do not repeat in the first column. This format works well if you have a time series of one variable for various countries, as each country’s data for different years can be spread across multiple columns.

Example of wide data:

wide_data <- data.frame(
  year = c(2016, 2017, 2018, 2019),
  consumption_BRA = c(5000, 5100, 5200, 5300),
  consumption_MEX = c(4000, 4100, 4200, 4300),
  consumption_USA = c(15000, 15500, 16000, 16500),
  investment_BRA = c(1500, 1550, 1600, 1650),
  investment_MEX = c(1200, 1250, 1300, 1350),
  investment_USA = c(5000, 5100, 5200, 5300),
  exports_BRA = c(1000, 1100, 1200, 1300),
  exports_MEX = c(2000, 2100, 2200, 2300),
  exports_USA = c(6000, 6200, 6400, 6600),
  imports_BRA = c(800, 850, 900, 950),
  imports_MEX = c(1000, 1050, 1100, 1150),
  imports_USA = c(3000, 3100, 3200, 3300)
)

wide_data
  year consumption_BRA consumption_MEX consumption_USA investment_BRA
1 2016            5000            4000           15000           1500
2 2017            5100            4100           15500           1550
3 2018            5200            4200           16000           1600
4 2019            5300            4300           16500           1650
  investment_MEX investment_USA exports_BRA exports_MEX exports_USA imports_BRA
1           1200           5000        1000        2000        6000         800
2           1250           5100        1100        2100        6200         850
3           1300           5200        1200        2200        6400         900
4           1350           5300        1300        2300        6600         950
  imports_MEX imports_USA
1        1000        3000
2        1050        3100
3        1100        3200
4        1150        3300

Long Format

In a long format, values do repeat in the first column. This format is more efficient when you have multiple variables for different countries, as it allows for easier manipulation and analysis.

Example of long data:

long_data <- data.frame(
  year = rep(2016:2019, each = 3),
  country = rep(c("BRA", "MEX", "USA"), times = 4),
  consumption = c(5000, 4000, 15000, 5100, 4100, 15500, 5200, 4200, 16000, 5300, 4300, 16500),
  investment = c(1500, 1200, 5000, 1550, 1250, 5100, 1600, 1300, 5200, 1650, 1350, 5300),
  exports = c(1000, 2000, 6000, 1100, 2100, 6200, 1200, 2200, 6400, 1300, 2300, 6600),
  imports = c(800, 1000, 3000, 850, 1050, 3100, 900, 1100, 3200, 950, 1150, 3300)
)
long_data
   year country consumption investment exports imports
1  2016     BRA        5000       1500    1000     800
2  2016     MEX        4000       1200    2000    1000
3  2016     USA       15000       5000    6000    3000
4  2017     BRA        5100       1550    1100     850
5  2017     MEX        4100       1250    2100    1050
6  2017     USA       15500       5100    6200    3100
7  2018     BRA        5200       1600    1200     900
8  2018     MEX        4200       1300    2200    1100
9  2018     USA       16000       5200    6400    3200
10 2019     BRA        5300       1650    1300     950
11 2019     MEX        4300       1350    2300    1150
12 2019     USA       16500       5300    6600    3300

Reshaping Data

Let’s start by reshaping a dataset from wide to long format. We’ll use the example dataset that contains GDP components (consumption, investment, exports, and imports) for multiple countries.

Wide to Long

To reshape the wide data data into long format with columns year, country, consumption, investment, exports, and imports, we’ll use the pivot_longer() function from the tidyr package:

long_data <- wide_data %>%
  pivot_longer(cols = -year,
               names_to = c("variable", "country"),
               names_sep = "_",
               values_to = "value") %>%
  pivot_wider(names_from = "variable", values_from = "value") %>%
  arrange(country,year)
long_data
# A tibble: 12 × 6
    year country consumption investment exports imports
   <dbl> <chr>         <dbl>      <dbl>   <dbl>   <dbl>
 1  2016 BRA            5000       1500    1000     800
 2  2017 BRA            5100       1550    1100     850
 3  2018 BRA            5200       1600    1200     900
 4  2019 BRA            5300       1650    1300     950
 5  2016 MEX            4000       1200    2000    1000
 6  2017 MEX            4100       1250    2100    1050
 7  2018 MEX            4200       1300    2200    1100
 8  2019 MEX            4300       1350    2300    1150
 9  2016 USA           15000       5000    6000    3000
10  2017 USA           15500       5100    6200    3100
11  2018 USA           16000       5200    6400    3200
12  2019 USA           16500       5300    6600    3300

Long to Wide

If we want to reshape the long data back into wide format, we can use the pivot_wider() function:

wide_data_again <- long_data %>%
  pivot_wider(names_from = "country",
              values_from = c("consumption", "investment", "exports", "imports"))
wide_data_again
# A tibble: 4 × 13
   year consumption_BRA consumption_MEX consumption_USA investment_BRA
  <dbl>           <dbl>           <dbl>           <dbl>          <dbl>
1  2016            5000            4000           15000           1500
2  2017            5100            4100           15500           1550
3  2018            5200            4200           16000           1600
4  2019            5300            4300           16500           1650
# ℹ 8 more variables: investment_MEX <dbl>, investment_USA <dbl>,
#   exports_BRA <dbl>, exports_MEX <dbl>, exports_USA <dbl>, imports_BRA <dbl>,
#   imports_MEX <dbl>, imports_USA <dbl>

Why Long Format is More Efficient for Calculations

Let’s consider an example where we need to calculate GDP for multiple countries using the components (consumption, investment, exports, and imports).

Wide Format Calculation

In wide format, you would need to perform the calculation for each year and each country separately:

wide_data <- wide_data %>%
  mutate(gdp_BRA = consumption_BRA + investment_BRA + exports_BRA - imports_BRA,
         gdp_MEX = consumption_MEX + investment_MEX + exports_MEX - imports_MEX,
         gdp_USA = consumption_USA + investment_USA + exports_USA - imports_USA)
wide_data
  year consumption_BRA consumption_MEX consumption_USA investment_BRA
1 2016            5000            4000           15000           1500
2 2017            5100            4100           15500           1550
3 2018            5200            4200           16000           1600
4 2019            5300            4300           16500           1650
  investment_MEX investment_USA exports_BRA exports_MEX exports_USA imports_BRA
1           1200           5000        1000        2000        6000         800
2           1250           5100        1100        2100        6200         850
3           1300           5200        1200        2200        6400         900
4           1350           5300        1300        2300        6600         950
  imports_MEX imports_USA gdp_BRA gdp_MEX gdp_USA
1        1000        3000    6700    6200   23000
2        1050        3100    6900    6400   23700
3        1100        3200    7100    6600   24400
4        1150        3300    7300    6800   25100

This approach becomes cumbersome if you have many countries and years.

Long Format Calculation

In long format, you can perform the calculation more efficiently:

gdp_data <- long_data %>%
   mutate(gdp = consumption + investment + exports - imports) 
  
gdp_data
# A tibble: 12 × 7
    year country consumption investment exports imports   gdp
   <dbl> <chr>         <dbl>      <dbl>   <dbl>   <dbl> <dbl>
 1  2016 BRA            5000       1500    1000     800  6700
 2  2017 BRA            5100       1550    1100     850  6900
 3  2018 BRA            5200       1600    1200     900  7100
 4  2019 BRA            5300       1650    1300     950  7300
 5  2016 MEX            4000       1200    2000    1000  6200
 6  2017 MEX            4100       1250    2100    1050  6400
 7  2018 MEX            4200       1300    2200    1100  6600
 8  2019 MEX            4300       1350    2300    1150  6800
 9  2016 USA           15000       5000    6000    3000 23000
10  2017 USA           15500       5100    6200    3100 23700
11  2018 USA           16000       5200    6400    3200 24400
12  2019 USA           16500       5300    6600    3300 25100

Using long format, the calculation is more straightforward and scalable, especially when dealing with large datasets.

Conclusion

In this tutorial, you’ve learned how to reshape data between wide and long formats in R using the dplyr and tidyr packages. We’ve discussed the benefits of each format and demonstrated how long format can simplify complex calculations. These techniques are essential for efficient data manipulation and analysis in R.